1 Lesson 5


1.0.1 Scatterplots and Perceived Audience Size

Notes: The points on the scatter plot are just the big cluster at the bottom


1.0.2 Scatterplots

Notes:

library(ggplot2)
pf <- read.delim("pseudo_facebook.tsv")

qplot(age, friend_count, data = pf)

ggplot(aes(x = age, y = friend_count), data = pf) +
  geom_point()


1.0.2.1 What are some things that you notice right away?

Response: It looks like younger users have a lot of friends. There are some vertical bars where people have lied about their age, like 69 and also about 1000. Those users are also likely to be teenagers or perhaps fake accounts given these really high friend counts.


1.0.3 ggplot Syntax

Notes:

summary(pf$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   28.00   37.28   50.00  113.00
ggplot(aes(x = age, y = friend_count), data = pf) + 
  geom_point() + 
  xlim(13, 90)
## Warning: Removed 4906 rows containing missing values (geom_point).


1.0.4 Overplotting

Notes:

ggplot(aes(x = age, y = friend_count), data = pf) + 
  geom_jitter(alpha = 0.05) + 
  xlim(13, 90)
## Warning: Removed 5188 rows containing missing values (geom_point).

1.0.4.1 What do you notice in the plot?

Response: With this new plot, we can see that the friend count for young users aren’t nearly as high as they looked before


1.0.5 Coord_trans()

Notes:

ggplot(aes(x = age, y = friend_count), data = pf) + 
  geom_point(alpha = 0.05) + 
  xlim(13, 90) +
  coord_trans(y = "sqrt")
## Warning: Removed 4906 rows containing missing values (geom_point).

1.0.5.1 Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

ggplot(aes(x = age, y = friend_count), data = pf) + 
  geom_point(alpha = 0.05, position = position_jitter(h = 0)) + 
  xlim(13, 90) +
  coord_trans(y = "sqrt")
## Warning: Removed 5182 rows containing missing values (geom_point).

1.0.5.2 What do you notice?

We can see the thresholds of friend count above which there are very few users.


1.0.6 Alpha and Jitter

Notes:

# This programming assignment
# will not be graded, but when you
# submit your code, the assignment
# will be marked as correct. By submitting
# your code, we can add to the feedback
# messages and address common mistakes
# in the Instructor Notes.

# You can assess your work by watching
# the solution video.


# Examine the relationship between
# friendships_initiated (y) and age (x)
# using the ggplot syntax.

# We recommend creating a basic scatter
# plot first to see what the distribution looks like.
# and then adjusting it by adding one layer at a time.

# What are your observations about your final plot?

# Remember to make adjustments to the breaks
# of the x-axis and to use apply alpha and jitter.

# ENTER ALL OF YOUR CODE FOR YOUR PLOT BELOW THIS

ggplot(aes(x = age, y = friendships_initiated), data = pf) +
  geom_point(alpha = 0.05, position = position_jitter(h = 0)) +
  xlim(13, 90) + 
  coord_trans(y = "sqrt")
## Warning: Removed 5181 rows containing missing values (geom_point).


1.0.7 Overplotting and Domain Knowledge

Notes: Percentage transform


1.0.8 Conditional Means

Notes:

# install.packages("dplyr")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
age_groups <- group_by(pf, age)
pf.fc_by_age <- summarise(age_groups,
                          friend_count_mean = mean(friend_count),
                          friend_count_median = median(friend_count),
                          n = n())
pf.fc_by_age <- arrange(pf.fc_by_age, age)
head(pf.fc_by_age)
## # A tibble: 6 x 4
##     age friend_count_mean friend_count_median     n
##   <int>             <dbl>               <dbl> <int>
## 1    13              165.                 74    484
## 2    14              251.                132   1925
## 3    15              348.                161   2618
## 4    16              352.                172.  3086
## 5    17              350.                156   3283
## 6    18              331.                162   5196
pf.fc_by_age <- pf %>%
  group_by(age) %>%
  summarise(friend_count_mean = mean(friend_count),
            friend_count_median = median(friend_count),
            n = n()) %>%
  arrange(age)

head(pf.fc_by_age)
## # A tibble: 6 x 4
##     age friend_count_mean friend_count_median     n
##   <int>             <dbl>               <dbl> <int>
## 1    13              165.                 74    484
## 2    14              251.                132   1925
## 3    15              348.                161   2618
## 4    16              352.                172.  3086
## 5    17              350.                156   3283
## 6    18              331.                162   5196

Create your plot!

# Plot mean friend count vs. age using a line graph.
# Be sure you use the correct variable names
# and the correct data frame. You should be working
# with the new data frame created from the dplyr
# functions. The data frame is called 'pf.fc_by_age'.

# Use geom_line() rather than geom_point to create
# the plot. You can look up the documentation for
# geom_line() to see what it does.

ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
  geom_line()

# oddness at age 69
# For our young users, they still have high friend counts, 
# and for the ages between 30 and 60, the mean count is hovering just about over 100.

1.0.9 Overlaying Summaries with Raw Data

Notes:

ggplot(aes(x = age, y = friend_count), data = pf) +
  xlim(13, 90) +
  geom_point(alpha = 0.05,
             position = position_jitter(h=0),
             color = "orange") +
  coord_trans(y = "sqrt") +
  geom_line(stat = "summary", fun.y = mean) +
  geom_line(stat = "summary", fun.y = quantile, 
            fun.args = list(probs = .1), linetype = 2, color = "blue") +
  geom_line(stat = "summary", fun.y = quantile, 
            fun.args = list(probs = .5), linetype = 2, color = "blue") +
  geom_line(stat = "summary", fun.y = quantile, 
            fun.args = list(probs = .9), linetype = 2, color = "blue") 
## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 5185 rows containing missing values (geom_point).

ggplot(aes(x = age, y = friend_count), data = pf) +
  coord_cartesian(xlim = c(13, 70), ylim = c(0, 1000)) +
  geom_point(alpha = 0.05,
             position = position_jitter(h=0),
             color = "orange") +
  geom_line(stat = "summary", fun.y = mean) +
  geom_line(stat = "summary", fun.y = quantile, 
            fun.args = list(probs = .1), linetype = 2, color = "blue") +
  geom_line(stat = "summary", fun.y = quantile, 
            fun.args = list(probs = .5), linetype = 2, color = "blue") +
  geom_line(stat = "summary", fun.y = quantile, 
            fun.args = list(probs = .9), linetype = 2, color = "blue") 

1.0.9.1 What are some of your observations of the plot?

Response: we can see that for 35 year olds to 60 year olds, the friend count falls below 250. So 90% of our users between this age group have less than 250 friends.


1.0.10 Moira: Histogram Summary and Scatterplot

See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

Notes: pass


1.0.11 Correlation

Notes:

cor.test(pf$age, pf$friend_count, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737
with(pf, cor.test(pf$age, pf$friend_count, method = "pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737

Look up the documentation for the cor.test function.

What’s the correlation between age and friend count? Round to three decimal places. Response: -0.0274(no meaningful relationship)


1.0.12 Correlation on Subsets

Notes:

with(subset(pf, age < 70), cor.test(age, friend_count, 
                                    method = "pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -52.326, df = 90664, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1775257 -0.1648889
## sample estimates:
##        cor 
## -0.1712144

1.0.13 Correlation Methods

Notes: Correlation Methods: Pearson’s r, Spearman’s rho, and Kendall’s tau


1.1 Create Scatterplots

Notes:

# Create a scatterplot of likes_received (y)
# vs. www_likes_received (x). Use any of the
# techniques that you've learned so far to
# modify the plot.

ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
  geom_point()


1.1.1 Strong Correlations

Notes:

ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
  geom_point() +
  coord_cartesian(xlim = c(0, quantile(pf$www_likes_received, 0.95)),
                  ylim = c(0, quantile(pf$likes_received, 0.95))) +
  geom_smooth(method = "lm", color = "red")

What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.

cor.test(pf$www_likes_received, pf$likes_received)
## 
##  Pearson's product-moment correlation
## 
## data:  pf$www_likes_received and pf$likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9473553 0.9486176
## sample estimates:
##       cor 
## 0.9479902

Response: 0.948, One of them was really a super set of the other.


1.1.2 Moira on Correlation

Notes: So typically, when I’m working on a problem and I’m going to be doing some kind of regression where I’m modeling something.

I’m going to be throwing some of these variables into the regression. And one of the assumptions of regression is these variables are independent of each other.

And so if any two are too highly correlated with each other, it will be really difficult to tell which ones are actually driving the phenomenon.

And so it’s important to measure the correlation between your variables first, often because it’ll help you determine which ones you don’t actually want to throw in together, and it might help you decide which ones you actually want to keep.


1.1.3 More Caution with Correlation

Notes:

#install.packages('alr3')
library(alr3)
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
data("Mitchell")
?Mitchell

Create your plot!

# Create a scatterplot of temperature (Temp)
# vs. months (Month).

ggplot(data = Mitchell, aes(x = Month, y = Temp)) +
  geom_point()

qplot(data = Mitchell, Month, Temp)


1.1.4 Noisy Scatterplots

  1. Take a guess for the correlation coefficient for the scatterplot. 0
  2. What is the actual correlation of the two variables? (Round to the thousandths place) 0.057
cor.test(Mitchell$Month, Mitchell$Temp)
## 
##  Pearson's product-moment correlation
## 
## data:  Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

1.1.5 Making Sense of Data

Notes:

ggplot(data = Mitchell, aes(x = Month, y = Temp)) +
  geom_point() +
  scale_x_continuous(breaks = seq(0, 203, 12))


1.1.6 A New Perspective

What do you notice? Response: a cyclical pattern, like a sine of cosine graph

Watch the solution video and check out the Instructor Notes! Notes:

ggplot(data = Mitchell, aes(x = (Month%%12), y = Temp)) +
  geom_point()


1.1.7 Understanding Noise: Age to Age Months

Notes:

# Create a new variable, 'age_with_months', in the 'pf' data frame.
# Be sure to save the variable in the data frame rather than creating
# a separate, stand-alone variable. You will need to use the variables
# 'age' and 'dob_month' to create the variable 'age_with_months'.

# Assume the reference date for calculating age is December 31, 2013.

pf$age_with_months <- pf$age + (12 - pf$dob_month) / 12

pf$age_with_months <- pf$age + (1 - pf$dob_month / 12)

pf$age_with_months <- with(pf, age + (1 - dob_month / 12))

1.1.8 Age with Months Means

# Create a new data frame called
# pf.fc_by_age_months that contains
# the mean friend count, the median friend
# count, and the number of users in each
# group of age_with_months. The rows of the
# data framed should be arranged in increasing
# order by the age_with_months variable.

# For example, the first two rows of the resulting
# data frame would look something like...

# age_with_months  friend_count_mean    friend_count_median n
#              13            275.0000                   275 2
#        13.25000            133.2000                   101 11


# See the Instructor Notes for two hints if you get stuck.
# This programming assignment will automatically be graded.

pf <- read.delim('pseudo_facebook.tsv')
pf$age_with_months <-pf$age + (1 - pf$dob_month / 12)
suppressMessages(library(dplyr))

Programming Assignment

pf.fc_by_age_months <- pf %>%
  group_by(age_with_months) %>%
  summarise(friend_count_mean = mean(friend_count),
            friend_count_median = median(friend_count),
            n = n()) %>%
  arrange(age_with_months)
head(pf.fc_by_age_months)
## # A tibble: 6 x 4
##   age_with_months friend_count_mean friend_count_median     n
##             <dbl>             <dbl>               <dbl> <int>
## 1            13.2              46.3                30.5     6
## 2            13.2             115.                 23.5    14
## 3            13.3             136.                 44      25
## 4            13.4             164.                 72      33
## 5            13.5             131.                 66      45
## 6            13.6             157.                 64      54
age_with_months_groups <- group_by(pf, age_with_months)
pf.fc_by_age_months2 <- summarise(age_with_months_groups,
                                  friend_count_mean = mean(friend_count),
                                  friend_count_median = median(friend_count),
                                  n = n())
pf.fc_by_age_months2 <- arrange(pf.fc_by_age_months2, age_with_months)
head(pf.fc_by_age_months2)
## # A tibble: 6 x 4
##   age_with_months friend_count_mean friend_count_median     n
##             <dbl>             <dbl>               <dbl> <int>
## 1            13.2              46.3                30.5     6
## 2            13.2             115.                 23.5    14
## 3            13.3             136.                 44      25
## 4            13.4             164.                 72      33
## 5            13.5             131.                 66      45
## 6            13.6             157.                 64      54

1.1.9 Noise in Conditional Means

# Create a new line plot showing friend_count_mean versus the new variable,
# age_with_months. Be sure to use the correct data frame (the one you created
# in the last exercise) AND subset the data to investigate users with ages less
# than 71.
ggplot(data = subset(pf.fc_by_age_months, age_with_months < 71), 
       aes(x = age_with_months, y = friend_count_mean)) +
  geom_line()


1.1.10 Smoothing Conditional Means

Notes:

p1 <- ggplot(data = subset(pf.fc_by_age, age < 71),
       aes(x = age, y = friend_count_mean)) +
  geom_line() +
  geom_smooth()

p2 <- ggplot(data = subset(pf.fc_by_age_months, age_with_months < 71), 
       aes(x = age_with_months, y = friend_count_mean)) +
  geom_line() +
  geom_smooth()

p3 <- ggplot(data = subset(pf.fc_by_age, age < 71),
       aes(x = round(age / 5) * 5, y = friend_count_mean)) +
  geom_line(stat = "summary", fun.y = mean)

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(p2, p1, p3, ncol = 1)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'


1.1.11 Which Plot to Choose?

Notes: One important answer is that you don’t have to choose. In exploratory data analysis, we’ll often create multiple visualizations and summaries of the same data, gleaning different incites from each.


1.1.12 Analyzing Two Variables

Reflection: We covered scatter plots, conditional means, and correlation coefficients.

we learned how to explore the relationship between two variables.

  1. augmented the scatter plot, with conditional summaries, like means.
  2. learned about the benefits and the limitations of using correlation.
  3. To understand the relationship between two variables and how correlation may effect your decisions over which variables to include in your final models. learning how to make sense of data through adjusting our visualizations.
  4. learned not to necessarily trust our interpretation of initial scatter plots like with the seasonal temperature data.
  5. learned how to use jitter and transparency to reduce over plotting.

Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!